The summary is organized in following chapters:
1. Chapter 1: Introduction of Project & SMART Questions
2. Chapter 2: Previous Research
3. Chapter 3: Description of the Data set
4. Chapter 4: EDA of Observations of Wildfires
5. Chapter 5: EDA of Annual and Monthly Statistics
6. Chapter 6: Hypothesis Testing
7. Chapter 7: Correlation and Linear Model
8. Chapter 8: Conclusion

Chapter 1: Introduction of Project & SMART Questions

California continues to experience longer wildfire seasons due to climate change. In the past couple of years, as the world navigated through the COVID-19 outbreak, Californians also had to deal with large wildfires. With rising concerns over climate change and its impact on our lives, our research aimed to study the causes and effects of wildfires and increase awareness of the issue. With an abundance of environmental data out there, we hoped to use some of it to come up with recommendations in terms of budget allocation and identify particular pain points.

Listed below are our SMART Questions. Due to the nature of our data, we wanted to investigate the conditions and causes of each wildfire to help figure out potential mitigations. We also wanted to look at how the budget played a factor in the frequency/intensity of the wildfires. With this initial look into how different variables play a factor into wildfires, we hoped to understand a bit more about what causes them and what recommendations to propose for mitigating them.

  1. Did the annual number of wildfires in California increase with an increase in the temperature between 1992-2015?

  2. Did higher precipitation levels result in a smaller or fewer number of wildfires in California between 1992-2015?

  3. Did a lower fire suppression expenditure in California result in larger wildfires between 1992-2015?

  4. Did lower levels of soil moisture in California result in larger or more frequent wildfires between 1992-2015?

Chapter 2: Previous Research

One of the first papers we looked at was “Climate change and growth scenarios for California wildfire” which talked about how population growth may affect the the frequency and intensity of wildfires. According to the authors Westerling, Bryant, and Preisler in human-induced climatic change and the increasing population in California are likely to impact large wildfires in the area directly. The research studied different scenarios for future population growth and wildland-urban interface relative to housing density and assessed the results for thirty years. In every scenario, there was an anticipated increased wildfire burned area which grew over time if the human-induced climatic change continues at the same intensity.

In the next paper, A Framework for Risk Assessment and Optimal Line Upgrade Selection to Mitigate Wildfire Risk the authors Sofia Taylor and Line A. Roald claim that overhead lines pose some risk to igniting wildfires, so they outline different factors that contribute to whether or not overhead lines should be converted to underground cables to mitigate fire risk. They developed a model to weigh these factors for different overhead lines and recommended which to put underground. This is relative to our topic because we investigated the amount of wildfires that were caused by equipment use, which ranked higher than we anticipated.

Last but not least, we read about a paper where the researchers How-Hang Liu, Ronald Y. Chang, Yi-Ying Chen, and I-Kang Fu proposed the use of IoT (Internet of Things) sensors to detect wildfires by monitoring wind speed, soil wetness, biomass, and other factors to identify when a fire has started. We also investigated soil moisture to see if it was a contributing factor to the intensity or frequency of wildfires.

References Westerling, A.L., Bryant, B.P., Preisler, H.K. et al. Climate change and growth scenarios for California wildfire. Climatic Change 109, 445–463 (2011). https://doi.org/10.1007/s10584-011-0329-9

Taylor, Sofia, Roald, A. Line. A Framework for Risk Assessment and Optimal Line Upgrade Selection to Mitigate Wildfire Risk. Arxiv (2021). https://arxiv.org/abs/2110.07348

Liu, How-Hang, Chang, Ronald Y., Chen, Yi-Ying, Fu, I-Kang. Sensor-Based Satellite IoT for Early Wildfire Detection. Arxiv (2021). https://arxiv.org/abs/2109.10505

WildfireData <- read.csv('final_wildfire.csv')
summary_nature=read.csv('summary_nature.csv')
summary_peoplecaused=read.csv('summary_peoplecaused.csv')
fire_budget  <- read.csv("fire_suppression.csv")
Avg_Temp <- WildfireData$tair_day_livneh_vic
Avg_SoilMoisture <- WildfireData$soilmoist1_day_livneh_vic
Avg_Rainfall <- WildfireData$rainfall_day_livneh_vic

Chapter 3: Description of Dataset

The first dataset we found was composed of 1.88 million wildfires around the United States as recorded by the Fire Program Analysis (FPA) system. We not only needed to find a dataset that met the dataset size requirement, but also wanted to look into a problem with many records so that we could conduct wide-reaching analysis. Thus, we settled on analyzing wildfires all across California from 1992 to 2013. However, this dataset alone would not help us get into the level of detail that we hoped for. For example, most of the columns in this dataset detailed the same characteristics such as months, years, days, unit numbers, and unit ID’s. For our analysis, we required more data about the conditions during these fires for the purpose of potentially mitigating future wildfires in the same conditions.

For that reason, we drew from other datasets in hopes of a more comprehensive analysis. Luckily, an organization called Cal-Adapt collects different peer-reviewed datasets about California from various data publishers and provides them for download in one location. When looking at the available data, three features stood out: soil moisture, air temperature, and rainfall. Naturally, we felt that these metrics were important factors in determining whether or not a wildfire will occur and how large it will be. All three of these datasets came in daily form, so we had to aggregate them by month and year if we wanted to properly incorporate them into our main dataset with all the wildfires.

We also hypothesized that the California state budget for fire suppression would be an interesting factor to observe over the years. Did increased/decreased spending have an impact on the frequency/intensity of wildfires over the years? We looked to answer this question in hopes of understanding how state government efforts were helping (or not) the wildfire situation in California. To incorporate this dataset into our main dataset, we merely merged them by year.

As a result, we had one comprehensive dataset with all of the features we initially wanted to look at (rainfall, air temperature, soil moisture, and California state budget) and how relevant they are to the sizes/frequencies of wildfires from 1992-2013. While there are surely many more potential causes of wildfires out there, we felt that these four factors would be a good start for some initial analysis. However, our dataset did have some limitations.

With the aforementioned datasets, we found that, while they were somewhat useful for our analysis, they weren’t as granular as we hoped they would be. For example, the dataset containing California’s yearly budget for fire suppression could have also included how the money was partitioned across different causes/uses. That way, we would be able to provide more detailed recommendations on where some funds should be redistributed. As it stands, knowing how much money is being dedicated to combating wildfires in itself is not incredibly useful.

We also noticed that our main dataset that contained records for every wildfire noted the cause for each of them. Some of them included arson, equipment use, lighting, and many others. However, we noticed in our analysis that miscellaneous causes accounted for the majority of the wildfires. Even a little more detail could have helped us with recommending mitigations.

In addition, the data dictionaries for our datasets regarding air temperature and soil moisture didn’t really detail what units the data was recorded in. Since we were dealing with data from California in the United States, we could only assume that Fahrenheit was used but even then, the measurements seemed a bit lower than expected. As for soil moisture, we had to look into what the common unit for this metric was, which turned out to be in bars.

Additional information that would have been useful is data about human patterns and activities and how they have potentially impacted the frequency/intensity of wildfires over the years. Studying the population growth and housing density in California over time in comparison to incidents of wildfires can provide increased understanding of the causes of human-caused wildfires.

3.1 Summary for dataset

temp=str_remove(fire_budget$Budget,"[$]")
temp=str_remove_all(temp,"[,]")
fire_budget$Budget=as.numeric(temp)

temp=str_remove(WildfireData$Budget,"[$]")
temp=str_remove_all(temp,"[,]")
WildfireData$Budget=as.numeric(temp)

str(WildfireData)
## 'data.frame':    189550 obs. of  17 variables:
##  $ X                        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Year                     : int  1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
##  $ DISCOVERY_DOY            : int  1 1 1 2 2 2 3 4 4 6 ...
##  $ Budget                   : num  85591000 85591000 85591000 85591000 85591000 ...
##  $ DISCOVERY_DATE           : num  2448622 2448622 2448622 2448624 2448624 ...
##  $ STAT_CAUSE_CODE          : int  8 5 1 8 9 9 2 9 9 7 ...
##  $ STAT_CAUSE_DESCR         : chr  "Children" "Debris Burning" "Lightning" "Children" ...
##  $ CONT_DATE                : num  NA 2448622 2448622 NA NA ...
##  $ CONT_DOY                 : int  NA 1 1 NA NA NA NA NA NA 6 ...
##  $ FIRE_SIZE                : num  0.2 5 0.1 0.1 0.2 0.5 0.1 0.1 0.1 0.1 ...
##  $ FIRE_SIZE_CLASS          : chr  "A" "B" "A" "A" ...
##  $ STATE                    : chr  "CA" "CA" "CA" "CA" ...
##  $ existDay                 : int  NA 0 0 NA NA NA NA NA NA 0 ...
##  $ tair_day_livneh_vic      : num  3.78 3.78 3.78 4.17 4.17 ...
##  $ month                    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ soilmoist1_day_livneh_vic: num  20.2 20.2 20.2 20.5 20.5 ...
##  $ rainfall_day_livneh_vic  : num  0.0616 0.0616 0.0616 1.1337 1.1337 ...

Chapter 4: EDA for Wildfire Observations

4.1 Histograms

library(ggplot2)
library(gridExtra)
#Average Temperature
TempHist <- ggplot(WildfireData, aes(Avg_Temp))+ 
  geom_histogram(binwidth = 0.5, bins = 100, col="black", fill="light blue 2") +
  labs(x="Avg. Temp (C)", y="Frequency", title="HISTOGRAM: Average Temprature") 


#Average Soil Moisture
SoilHist <- ggplot(WildfireData, aes(Avg_SoilMoisture))+ 
  geom_histogram(binwidth = 0.5, bins = 100, col="black", fill="orange red 2") +
  labs(x="Avg. Soil Moisture", y="Frequency", title="HISTOGRAM: Average Soil Moisture") 

#Average Rainfall
RainHist <- ggplot(WildfireData, aes(Avg_Rainfall))+ 
  geom_histogram(binwidth = 0.5, bins = 100, col="black", fill="green 3") +
  labs(x="Avg. Rainfall", y="Frequency", title="HISTOGRAM: Average Rainfall") 

#Wildfire Count by Year
CountHist <- ggplot(WildfireData, aes(Year))+ 
  geom_histogram(binwidth = 0.10, bins = 100, col="black", fill="yellow", stat="count") +
  labs(x="Years", y="Frequency of Wildfires", title="Wildfires count by year", )

Histograms <- grid.arrange(TempHist, SoilHist, RainHist, CountHist, ncol=2, nrow=2)

ggsave("Histograms.jpg", plot = Histograms)
## Saving 7 x 5 in image

4.2 Bar Graphs

#Fire Size
FireBar <- ggplot(data = WildfireData, aes(x = FIRE_SIZE_CLASS)) +
  geom_bar(col="black", fill="orange")+
  labs(x="Fire Size Class", y="Frequency", title="Frequency of Wildfires by Size Classes") 


#Years
YearsBar <- ggplot(data = WildfireData, aes(x = Year)) +
  geom_bar(col="black", fill="yellow")+
  labs(x="Years", y="Frequency", title="Frequency of Wildfires by Year")

#Budget
BudgetBar <- ggplot(data = WildfireData, aes(x = Budget)) +
  geom_bar(col="black", fill="Pink 2")+
  labs(x="Budget", y="Frequency", title="Frequency of Wildfires by Budget")


grid.arrange(FireBar, YearsBar, nrow=2)

4.3 Pie Charts

lbls <- c("A", "B", "C", "D", "E", "F", "G", "E")
pie((table(WildfireData$FIRE_SIZE_CLASS)), col=rainbow(length(lbls)), main="Pie Chart of Fire Size Class")

lbls <- c("Lightning", "Eqipment Use", "Smoking", "Campfire", "Debris Burning", "Railroad", "Arson", "Children", "Misc." )
pie((table(WildfireData$STAT_CAUSE_DESCR)), col=rainbow(length(lbls)), main="Pie Chart of Wildfire Cause")

4.4 Line Charts by Year

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
yearly_count <- WildfireData %>% count(Year)
colnames(yearly_count) <- c("Year", "Count")
ggplot(yearly_count, aes(x=Year, y=Count, group=1)) + geom_line() + ggtitle("Yearly Recorded Fires")

temp=str_remove(fire_budget$Budget,"[$]")
temp=str_remove_all(temp,"[,]")
fire_budget$Budget=as.numeric(temp)
ggplot(fire_budget, aes(x=Year, y=Budget, group=1)) + geom_line() + ggtitle("California Fire Suppression Budget 1979-2021")

dat <- aggregate(FIRE_SIZE ~ Year, WildfireData, mean)
ggplot(dat, aes(x=Year, y=FIRE_SIZE, group=1)) + geom_line() + ggtitle("Wildfire Sizes (1992-2013)") + ylab("Fire Size")

After the EDA, we came away with several thoughts. One of the first metrics we wanted to look into was the California yearly budget for fire suppression. Wildfires have been an issue in this state for many years, so looking at whether or not the funding for wildfire suppression was helping was an important part of our analysis. After visualizing the budget and seeing that it ballooned to over $1 billion over the past couple years, our question about whether or not increased spending helped mitigate wildfires quickly turned into: what was the money going towards? Other than that, our previous questions only got a little more granular. It was clear to see that soil moisture, rainfall, and air temperature didn’t have much effect on the frequency of wildfires from 1992-2015. But we still wanted to know if the conditions during each class and cause of fire differed and by how much.

Chapter 5: EDA for monthly and annually statistics of wildfire: scatterplot, box-plot and ANOVA

final_fire=read.csv('final_wildfire.csv')
summary_nature=read.csv('summary_nature.csv')
summary_peoplecaused=read.csv('summary_peoplecaused.csv')
colnames(summary_nature)[4]='temperature'
colnames(summary_nature)[5]='soilmoisture'
colnames(summary_nature)[6]='rainfall'
colnames(summary_nature)[7]='nfire'
colnames(summary_peoplecaused)[4]='temperature'
colnames(summary_peoplecaused)[5]='soilmoisture'
colnames(summary_peoplecaused)[6]='rainfall'
colnames(summary_peoplecaused)[7]='nfire'
summary_peoplecaused$Year=as.factor(summary_peoplecaused$Year)

summary_peoplecaused$month=as.factor(summary_peoplecaused$month)
summary_nature$Year=as.factor(summary_nature$Year)

summary_nature$month=as.factor(summary_nature$month)

6.1: Plots of Annual and Monthly Summary

6.1.1: Plotting the year trend

library(ggplot2)

temp_plot=aggregate(nfire~Year,summary_nature,sum)

temp_plot2=aggregate(nfire~Year,summary_peoplecaused,sum)

ggplot() +geom_point(data=temp_plot, aes(x=Year, y=nfire), colour='blue') + geom_point(data=temp_plot2, aes(x=Year, y=nfire),colour='red')+labs(title='Number of Fires Each Year (Red for people-caused, Blue for other reasons)',y='Number of Fires')

6.1.2: Box-plots of year and month to show trend

library(ggpubr)
ggplot(summary_peoplecaused, mapping=aes(x=Year,y=nfire)) + geom_boxplot()+ggtitle('box-plot of number of people-caused fires for different years')+ylab('Number of Fires')

ggplot(summary_peoplecaused, mapping=aes(x=month,y=nfire)) + geom_boxplot()+ggtitle('box-plot of number of people-caused fires for different months')+ylab('Number of Fires')

ggplot(summary_nature, mapping=aes(x=Year,y=nfire)) + geom_boxplot()+ggtitle('box-plot of number of fires caused by other reasons for different years')+ylab('Number of Fires')

ggplot(summary_nature, mapping=aes(x=month,y=nfire)) + geom_boxplot()+ggtitle('box-plot of number of fires caused by other reasons for different months')+ylab('Number of Fires')

ggplot(summary_peoplecaused, mapping=aes(x=Year,y=temperature)) + geom_boxplot()+ggtitle('box-plot of temperature for different years')+ylab('temperature')

ggplot(summary_peoplecaused, mapping=aes(x=month,y=temperature)) + geom_boxplot()+ggtitle('box-plot of temperature for different months')+ylab('temperature')

ggplot(summary_nature, mapping=aes(x=Year,y=soilmoisture)) + geom_boxplot()+ggtitle('box-plot of soil moisture for different years')+ylab('soil moisture')

ggplot(summary_nature, mapping=aes(x=month,y=soilmoisture)) + geom_boxplot()+ggtitle('box-plot of soil moisture for different months')+ylab('soil moisture')

ggplot(summary_nature, mapping=aes(x=Year,y=rainfall)) + geom_boxplot()+ggtitle('box-plot of rainfall for different years')+ylab('average daily rainfall')

ggplot(summary_nature, mapping=aes(x=month,y=rainfall)) + geom_boxplot()+ggtitle('box-plot of rainfall for different months')+ylab('average daily rainfall')

The result of box-plot suppose the annually summary for all variables(fires number, temperature rainfall and moisture) are consistent and monthly summary for all variables are different. It supposes that we could need to take anova test for the mean with those categories.

6.2: Anova test on the year and month statistics

summary(aov(nfire~Year,summary_peoplecaused))
##              Df  Sum Sq Mean Sq F value Pr(>F)
## Year         21  360393   17162   0.849  0.657
## Residuals   242 4892242   20216
summary(aov(nfire~month,summary_peoplecaused))
##              Df  Sum Sq Mean Sq F value Pr(>F)    
## month        11 4167871  378897   88.02 <2e-16 ***
## Residuals   252 1084764    4305                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(nfire~Year,summary_nature))
##              Df   Sum Sq Mean Sq F value Pr(>F)
## Year         21  1352086   64385   0.608  0.911
## Residuals   242 25626302  105894
summary(aov(nfire~month,summary_nature))
##              Df   Sum Sq Mean Sq F value Pr(>F)    
## month        11 21164133 1924012   83.39 <2e-16 ***
## Residuals   252  5814256   23072                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(temperature~Year,summary_peoplecaused))
##              Df Sum Sq Mean Sq F value Pr(>F)
## Year         21     31    1.47   0.033      1
## Residuals   242  10855   44.85
summary(aov(temperature~month,summary_peoplecaused))
##              Df Sum Sq Mean Sq F value Pr(>F)    
## month        11  10458   950.7   560.1 <2e-16 ***
## Residuals   252    428     1.7                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(soilmoisture~Year,summary_nature))
##              Df Sum Sq Mean Sq F value Pr(>F)
## Year         21  129.9   6.185   0.491  0.972
## Residuals   242 3048.5  12.597
summary(aov(soilmoisture~month,summary_nature))
##              Df Sum Sq Mean Sq F value Pr(>F)    
## month        11 2616.9  237.90   106.8 <2e-16 ***
## Residuals   252  561.5    2.23                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(rainfall~Year,summary_nature))
##              Df Sum Sq Mean Sq F value Pr(>F)  
## Year         21  34.08   1.623   1.458 0.0931 .
## Residuals   242 269.34   1.113                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(rainfall~month,summary_nature))
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## month        11  95.17   8.652   10.47 8.85e-16 ***
## Residuals   252 208.25   0.826                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have the null hypothesis that the mean of particular varible with different categories is same and the alternative hypothesis suppose that mean of particular varible with different categories is different The anova test give the results For all of variables with annually categories, we failed to reject the null hypothesis For all of variables with monthly categories, we reject the null hypothesis.

The result suppose the variables(fires number, temperature, rainfall and moisture) are consistent with year and have great difference with months

Chapter 6: Hypothesis Testing

To investigate how the conditions differed during each class and cause of fire, we used Hypothesis testing- specifically t-tests. We first had to split the wildfires up into different groups for the first hypothesis test. We looked to compare different classes of wildfires and how different conditions may have been. Classes A-G describe the size of the fire with Class A being the smallest and Class G being the largest.

classA <- WildfireData[WildfireData$FIRE_SIZE_CLASS == 'A',]
classB <- WildfireData[WildfireData$FIRE_SIZE_CLASS == 'B',]
classC <- WildfireData[WildfireData$FIRE_SIZE_CLASS == 'C',]
classD <- WildfireData[WildfireData$FIRE_SIZE_CLASS == 'D',]
classE <- WildfireData[WildfireData$FIRE_SIZE_CLASS == 'E',]
classF <- WildfireData[WildfireData$FIRE_SIZE_CLASS == 'F',]
classG <- WildfireData[WildfireData$FIRE_SIZE_CLASS == 'G',]

When comparing the conditions during the smallest wildfires to the largest wildfires, it appears that air temperature was lower, soil moisture was higher, and rainfall was higher during less intense wildfires.

t.test(classA$tair_day_livneh_vic, classG$tair_day_livneh_vic, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  classA$tair_day_livneh_vic and classG$tair_day_livneh_vic
## t = -8.7583, df = 89797, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.195132 -2.026586
## sample estimates:
## mean of x mean of y 
##  18.70072  21.31158
t.test(classA$soilmoist1_day_livneh_vic, classG$soilmoist1_day_livneh_vic, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  classA$soilmoist1_day_livneh_vic and classG$soilmoist1_day_livneh_vic
## t = 8.175, df = 89797, p-value = 2.998e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.8259187 1.3468515
## sample estimates:
## mean of x mean of y 
##  13.08929  12.00291
t.test(classA$rainfall_day_livneh_vic, classG$rainfall_day_livneh_vic, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  classA$rainfall_day_livneh_vic and classG$rainfall_day_livneh_vic
## t = 4.4724, df = 89797, p-value = 7.744e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1484135 0.3799739
## sample estimates:
## mean of x mean of y 
## 0.4636929 0.1994992
code1 <- WildfireData[WildfireData$STAT_CAUSE_CODE == 1,]
code2 <- WildfireData[WildfireData$STAT_CAUSE_CODE == 2,]
code7 <- WildfireData[WildfireData$STAT_CAUSE_CODE == 7,]

Now, we will compare different groups of wildfires- categorized by their causes.

Code 1: Lightning
Code 2: Equipment Use
Code 7: Arson

When looking at the wildfires caused by Lightning versus those caused by Equipment Use, average temperature, soil moisture, and rainfall in CA were significantly different. In particular, during Lighting-caused wildfires, air temperature was higher, soil moisture was lower, and rainfall was higher.

t.test(code1$tair_day_livneh_vic, code2$tair_day_livneh_vic, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  code1$tair_day_livneh_vic and code2$tair_day_livneh_vic
## t = 85.583, df = 62967, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.936384 3.074032
## sample estimates:
## mean of x mean of y 
##  22.10035  19.09514
t.test(code1$soilmoist1_day_livneh_vic, code2$soilmoist1_day_livneh_vic, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  code1$soilmoist1_day_livneh_vic and code2$soilmoist1_day_livneh_vic
## t = -38.009, df = 62967, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6264540 -0.5650133
## sample estimates:
## mean of x mean of y 
##  12.19398  12.78972
t.test(code1$rainfall_day_livneh_vic, code2$rainfall_day_livneh_vic, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  code1$rainfall_day_livneh_vic and code2$rainfall_day_livneh_vic
## t = 30.919, df = 62967, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2152689 0.2444081
## sample estimates:
## mean of x mean of y 
## 0.5694914 0.3396529

When comparing the wildfires caused by Lightning versus those caused by Arson, it appears that the air temperature, soil moisture, and average rainfall in CA were significantly different. In particular, during lightning-caused wildfires, air temperature was higher, soil moisture was lower, and average rainfall was higher.

t.test(code1$tair_day_livneh_vic, code7$tair_day_livneh_vic, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  code1$tair_day_livneh_vic and code7$tair_day_livneh_vic
## t = 82.448, df = 43091, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.208857 3.365139
## sample estimates:
## mean of x mean of y 
##  22.10035  18.81335
t.test(code1$soilmoist1_day_livneh_vic, code7$soilmoist1_day_livneh_vic, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  code1$soilmoist1_day_livneh_vic and code7$soilmoist1_day_livneh_vic
## t = -39.235, df = 43091, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7279409 -0.6586716
## sample estimates:
## mean of x mean of y 
##  12.19398  12.88729
t.test(code1$rainfall_day_livneh_vic, code7$rainfall_day_livneh_vic, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  code1$rainfall_day_livneh_vic and code7$rainfall_day_livneh_vic
## t = 26.917, df = 43091, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2019315 0.2336487
## sample estimates:
## mean of x mean of y 
## 0.5694914 0.3517013

When looking at conditions during arson-caused wildfires versus equipment use-caused wildfires, it appears that air temperature and soil moisture were significantly different. In particular, during arson-caused fires, air temperature was lower and soil moisture was higher.

t.test(code7$tair_day_livneh_vic, code2$tair_day_livneh_vic, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  code7$tair_day_livneh_vic and code2$tair_day_livneh_vic
## t = -6.3054, df = 56768, p-value = 2.895e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3693827 -0.1941972
## sample estimates:
## mean of x mean of y 
##  18.81335  19.09514
t.test(code7$soilmoist1_day_livneh_vic, code2$soilmoist1_day_livneh_vic, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  code7$soilmoist1_day_livneh_vic and code2$soilmoist1_day_livneh_vic
## t = 4.7886, df = 56768, p-value = 1.684e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.05763517 0.13750998
## sample estimates:
## mean of x mean of y 
##  12.88729  12.78972
t.test(code7$rainfall_day_livneh_vic, code2$rainfall_day_livneh_vic, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  code7$rainfall_day_livneh_vic and code2$rainfall_day_livneh_vic
## t = 1.3998, df = 56768, p-value = 0.1616
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.004821767  0.028918604
## sample estimates:
## mean of x mean of y 
## 0.3517013 0.3396529

Chapter 7: Correlation and Liner Regression model

7.1: Correlation Matrix

temp=str_remove(summary_nature$Budget,"[$]")
temp=str_remove_all(temp,"[,]")
summary_nature$Budget=as.numeric(temp)
temp=str_remove(summary_peoplecaused$Budget,"[$]")
temp=str_remove_all(temp,"[,]")
summary_peoplecaused$Budget=as.numeric(temp)
cor_nature=cor(summary_nature[c(4:9)])

library(corrplot)
## corrplot 0.92 loaded
cor_people=cor(summary_peoplecaused[c(4:9)])



colnames(summary_nature)[7]='nature_fire'
summary_nature$people_caused_fires=summary_peoplecaused$n
summary_nature$total=summary_nature$nature_fire+summary_peoplecaused$n
cor_total=cor(summary_nature[c(4,5,6,9,7,10,11)])


corrplot(cor_total,method='number',type = 'lower', diag = TRUE)

From the EDA, we try to evaluate the wildfires in two cases: people-caused and other reason(usually natrual reasons). The result suppose that fires number for both type(man-caused and other) has strong correlation with temperature and moisture. This implies that we could make model to evaluate the influence of nature factors to answer SMART question.

7.2: Data Modelling

Created model for fires and high correlated variable and check their summary, use vif to determine the variable usage

7.2.1: Using residual plot and qq-plot to check their normality

model1=lm(total~temperature,data=summary_nature)
summary(model1)
## 
## Call:
## lm(formula = total ~ temperature, data = summary_nature)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -487.32 -156.40  -17.98  115.94  786.54 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -364.702     31.780  -11.48   <2e-16 ***
## temperature   60.159      2.077   28.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216.7 on 262 degrees of freedom
## Multiple R-squared:  0.7621, Adjusted R-squared:  0.7612 
## F-statistic: 839.3 on 1 and 262 DF,  p-value: < 2.2e-16
model2=lm(total~temperature+soilmoisture,data=summary_nature)
plot(model2)

summary(model2)
## 
## Call:
## lm(formula = total ~ temperature + soilmoisture, data = summary_nature)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -451.32 -143.28  -25.17  113.08  777.87 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   210.385    160.014   1.315 0.189733    
## temperature    48.050      3.878  12.389  < 2e-16 ***
## soilmoisture  -26.296      7.178  -3.664 0.000301 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 211.7 on 261 degrees of freedom
## Multiple R-squared:  0.7737, Adjusted R-squared:  0.772 
## F-statistic: 446.2 on 2 and 261 DF,  p-value: < 2.2e-16
model3=lm(total~temperature+soilmoisture+rainfall,data=summary_nature)
summary(model3)
## 
## Call:
## lm(formula = total ~ temperature + soilmoisture + rainfall, data = summary_nature)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -436.00 -146.77  -19.03  108.12  770.88 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   295.033    171.473   1.721 0.086517 .  
## temperature    47.261      3.915  12.070  < 2e-16 ***
## soilmoisture  -32.356      8.441  -3.833 0.000159 ***
## rainfall       22.824     16.798   1.359 0.175424    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 211.4 on 260 degrees of freedom
## Multiple R-squared:  0.7753, Adjusted R-squared:  0.7727 
## F-statistic: 299.1 on 3 and 260 DF,  p-value: < 2.2e-16
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(model3)
##  temperature soilmoisture     rainfall 
##     3.735761     5.069232     1.916655
vif(model2)
##  temperature soilmoisture 
##     3.653691     3.653691
model4=lm(total~soilmoisture,data=summary_nature)

summary(model4)
## 
## Call:
## lm(formula = total ~ soilmoisture, data = summary_nature)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -592.40 -176.14  -37.79  118.51  957.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2050.416     74.892   27.38   <2e-16 ***
## soilmoisture -102.079      4.723  -21.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 266.3 on 262 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 467.1 on 1 and 262 DF,  p-value: < 2.2e-16
model5=lm(log(total)~temperature+soilmoisture,data=summary_nature)
residualPlot(model5)

plot(model5)

summary(model5)
## 
## Call:
## lm(formula = log(total) ~ temperature + soilmoisture, data = summary_nature)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92313 -0.27901 -0.02392  0.23857  1.31414 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.44582    0.30364  27.815  < 2e-16 ***
## temperature   0.05834    0.00736   7.926 6.59e-14 ***
## soilmoisture -0.23920    0.01362 -17.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4017 on 261 degrees of freedom
## Multiple R-squared:  0.8949, Adjusted R-squared:  0.8941 
## F-statistic:  1112 on 2 and 261 DF,  p-value: < 2.2e-16
vif(model5)
##  temperature soilmoisture 
##     3.653691     3.653691

Try the different setting on model: model with temperature; temperature and moisture,temperature, moisture and, rainfall. The VIF check find the inflation of factors with 3 variables. Then the best factor set is temperature and moisture. The model is also plotted with residual, the residual mean is strongly curved. After applying log transformation on the fire number, the results was much better. The gg-plot suppose the approximate normal for the residual. The model has the r-square of 0.89 and the F-test for p-value: < 2.2e-16. Both results suppose the model can strongly predict the monthly wildfire case by monthly average temperature and average soil moisture.

Chapter 8: Conclusion

After exploring the data and conducting our tests, we concluded multiple conclusions. After looking into the effect of lower/higher air temperature and rainfall over the years, they didn’t seem to have much effect on the surface, so we decided to split up the wildfires into groups based on class and cause. Regarding Class A (smaller) versus Class G (larger) fires, Class A is much more frequent, air temperature and rainfall are lower, and soil moisture is higher. This conclusion, however, seems fairly elementary, so we looked into wildfires by causes to investigate if conditions varied between them. During lightning-caused fires, air temperature and rainfall were higher, and soil moisture was lower than fires caused by equipment use and arson. During arson-caused fires, air temperature and rainfall were lower and soil moisture was higher than equipment use-caused fires. While arson is largely unpredictable and difficult to analyze, necessary precautions could be taken when it is raining to lower the chances that lightning causes fires. For example, removing dry leaves and foliage would be a good start.

In addition, after looking into the budget spending by California, it seems that it has been rising considerably over the last two decades. Still, the frequency and intensity of fires over the years hasn’t changed. This begs the question- what exactly is the budget going towards, and where can some of it go instead? Had we been able to see how the budget was partitioned, we could have made recommendations based on our findings on where that funding should go. For now, we recommend the state of California evaluate carefully how that money is being spent, find ways to keep the soil moist and remove dry foliage from the ground.